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Abstract 


Recent studies have advocated biomonitoring using DNA techniques. In this study, two high-throughput sequenc- 
ing (HTS)-based methods were evaluated: amplicon metabarcoding of the cytochrome C oxidase subunit I (COI) 
mitochondrial gene and gene enrichment using MYbaits (targeting nine different genes including COI). The gene- 
enrichment method does not require PCR amplification and thus avoids biases associated with universal primers. 
Macroinvertebrate samples were collected from 12 New Zealand rivers. Macroinvertebrates were morphologically 
identified and enumerated, and their biomass determined. DNA was extracted from all macroinvertebrate samples 
and HTS undertaken using the ILLUMINA MisEQ platform. Macroinvertebrate communities were characterized from 
sequence data using either six genes (three of the original nine were not used) or just the COI gene in isolation. The 
gene-enrichment method (all genes) detected the highest number of taxa and obtained the strongest Spearman rank 
correlations between the number of sequence reads, abundance and biomass in 67% of the samples. Median detec- 
tion rates across rare (<1% of the total abundance or biomass), moderately abundant (1-5%) and highly abundant 
(>5%) taxa were highest using the gene-enrichment method (all genes). Our data indicated primer biases occurred 
during amplicon metabarcoding with greater than 80% of sequence reads originating from one taxon in several sam- 
ples. The accuracy and sensitivity of both HTS methods would be improved with more comprehensive reference 
sequence databases. The data from this study illustrate the challenges of using PCR amplification-based methods for 
biomonitoring and highlight the potential benefits of using approaches, such as gene enrichment, which circumvent 
the need for an initial PCR step. 
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Introduction 


Environmental biomonitoring has been used for decades 
to characterize and monitor the quality of an environ- 
ment. The organisms present in a sample, and their 
abundance, biomass and health can all act as robust indi- 
cators of the quality of their environment (Duggan et al. 
2002; Sharley et al. 2004; Hodkinson & Jackson 2005; 
Baird & Hajibabaei 2012). Analysis of biological samples 
commonly requires taxonomic identification and 
enumeration, which traditionally involved microscopy 
or microbiological methods. These techniques can be 
time-consuming and laborious, and are dependent on 
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taxonomic expertise (Wood ef al. 2013). In recent years, 
molecular approaches have been promoted as a solution 
to overcome many of these challenges (Baird & 
Hajibabaei 2012; Kelly et al. 2014). 

Recent advances in DNA sequencing technologies 
have very rapidly enhanced the potential of using molec- 
ular approaches for biomonitoring (Shokralla et al. 2012). 
High-throughput sequencing (HTS) has now made it 
possible to recover an unpreceded number of DNA 
sequences directly from environmental samples (e.g. 
Shokralla et al. 2012; Thomsen & Willerslev 2015; de Var- 
gas et al. 2015). Amplicon metabarcoding enables the 
rapid identification of a species by matching short gene 
fragments (from HTS data) to a DNA sequence reference 
library, which is usually generated from morphologically 
verified specimens (Hajibabaei et al. 2011; Taberlet et al. 
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2012). Amplicon metabarcoding has many advantages 
over traditional morphology-based methods or other 
molecular techniques such as clone libraries as it can pro- 
duce large amounts of data within relatively short time 
frames and does not rely heavily on taxonomic expertise 
(Ji et al. 2013). Amplicon metabarcoding has proven very 
effective for characterizing eukaryotic and prokaryotic 
biodiversity in a range of terrestrial (Senstebe et al. 2010; 
Andersen et al. 2012; Yang et al. 2014), marine (Chariton 
et al. 2010; Creer et al. 2010; Pochon et al. 2013, 2015a; de 
Vargas et al. 2015) and freshwater (Hajibabaei et al. 2011; 
Brasell et al. 2014) habitats. An increasing number of 
studies are now showing amplicon-metabarcoding’s 
applicability for routine biomonitoring (e.g. Kermarrec 
et al. 2014; Pawlowski et al. 2014; Dowle et al. 2015; 
Pochon et al. 2015b; Vivien et al. 2015). Most metabarcod- 
ing biomonitoring studies to date have focused on how 
organism-specific HTS data (e.g. bacteria, foraminifera 
and general eukaryotes) compare to other abiotic and 
biotic variables (e.g. sediment redox, macroinfauna 
abundance; Pawlowski et al. 2014; Dowle et al. 2015; 
Lejzerowicz et al. 2015) at the study sites. Researchers 
have paralleled morphological and HTS-based app- 
roaches, for example Lindeque et al. 2013; but few 
studies have robustly compared abundance data 
between the two (Visco et al. 2015). Of those studies most 
have focused on misidentified or undetected organisms 
(Hajibabaei et al. 2011). 

The mitochondrial cytochrome C oxidase subunit I 
(COI) gene has been widely used for identifying a large 
diversity of organisms. As of 2015, the Barcode of Life 
COI database (http://www.boldsystems.org) contains 
sequences for more than four million specimens from 
over 200 000 species, more than is available for any other 
gene. Metabarcoding of biological communities is 
critically dependent on successful amplification of the 
barcode region(s) for all relevant taxa in a sample. The 
Folmer primers (Folmer et al. 1994) were originally 
described as ‘universal’ (i.e. will amplify COI from a 
wide variety of taxa). Recent studies suggest primer 
incompatibility with some taxa resulting in under-repre- 
sentation of these taxa and therefore reducing the useful- 
ness of COI for metabarcoding (Clarke et al. 2014; Deagle 
et al. 2014). 

Despite these limitations, COI has been used for a 
number of metabarcoding studies (e.g. (Hajibabaei et al. 
2011; Ji et al. 2013; Yang et al. 2014; Zaiko et al. 2015). 
These studies have primarily used the 454 pyrosequenc- 
ing platform. The major advantage of 454 pyrosequenc- 
ing is its long read length. However, 454 has a number of 
limitations including; a relatively high cost for reagents 
which may limit its capability for cost-effective analysis 
of large environmental data sets (Claesson et al. 2010; 
Loman et al. 2012), fewer sequences are generated 
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compared with other platforms (e.g. ILLUMINA MISEQ or 
HISEQ), and it is likely that 454 will not be supported in 
the next few years (Bio-IT 2013). Alternative HTS plat- 
forms include the Illumina system, and the Ion Torrent, 
however, the COI barcode region (Folmer et al. 1994) is 
too long to be completely sequenced by these platforms. 
Some metabarcoding studies targeting COI have also 
used modified primers covering a shorter region (Meus- 
nier et al. 2008; Hajibabaei et al. 2011, 2012; Zeale et al. 
2011; Leray et al. 2013). 

Other solutions to the reduced read length of non-454 
instruments, and lack of primer universality include the 
following (i) using different markers (e.g. nuclear 18S 
rDNA; Pawlowski et al. (2012) — however taxonomic 
resolution may be limited using this gene, (Pochon et al. 
2013), (ii) using several barcode markers for each sample 
(Deagle et al. 2009; De Barba et al. 2014), (iii) using modi- 
fied primers with inosine nucleotides (complementary to 
all four DNA nucleotides) at variable binding sites 
(Geller et al. 2013; Deagle et al. 2014) or (iv) using a tech- 
nique that does not require universal primers. 

Genetic methods not relying on PCR to characterize 
communities include ‘shotgun sequencing’ (Simon & 
Daniel 2011; Wang ef al. 2013), although there are chal- 
lenges with retrieving COI sequences directly from total 
genomic samples, or mitochondrial enrichment during 
DNA extraction followed by ultra-deep HTS and bioin- 
formatics to retrieve COI barcode sequences (Zhou et al. 
2013). Both methods require ultra-deep sequencing that 
is currently too expensive for routine environmental 
biomonitoring. A third potential technique is gene 
enrichment without PCR amplification. Gene enrichment 
uses an enormous number (c. 20 000) of short (c. 
100-mer) synthetic DNA probes, designed from reference 
sequences, that are complementary to regions in the gen- 
omes of the target taxa. The complementary sequences 
are bound to a substrate that enables the capture of DNA 
target regions (Maricic et al. 2010; Mertes et al. 2011). The 
captured DNA can then be sequenced using HTS with- 
out the need for universal primers. Probe enrichment 
techniques are now commonly used for human genomic 
research (Blow 2009), phylogenetic and evolutionary 
studies (McCormack et al. 2013; Giarla & Esselstyn 2015) 
and by researchers working on ancient DNA (Briggs 
et al. 2009; Horn 2012). To our knowledge, gene 
enrichment has not been applied to environmental 
biomonitoring. 

We hypothesized that because of the lack of truly uni- 
versal primer sets, and inherent biases with PCR-based 
approaches, a gene-enrichment technique would provide 
a better estimate of the abundance and biomass of target 
organisms in environmental samples compared with 
PCR-based approaches. In this study, we focused on 
macroinvertebrates from river ecosystems as they are 
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used worldwide to measure water quality (e.g. Fleituch 
et al. 2002; Chessman 2003; Buffagni et al. 2004). Several 
biotic indices generated from morphological identifica- 
tion of specimens have been developed to assess water 
quality. For example, the Macroinvertebrate Community 
Index (MCI; Stark 1985) has been used for the past three 
decades in New Zealand, and the Biological Monitoring 
Working Party (BMWP) Score System (BMWP 1978) is 
used in the United Kingdom. In addition to comparing 
the data generated by the two molecular HTS-based 
methods, we undertook a robust comparison of the HTS 
information with taxa abundance and biomass obtained 
using morphology and microscopy. We also evaluated 
how the data from all methods performed when incorpo- 
rated into the MCI. 


Materials and methods 


Sample collection and morphological identification 


Quantitative benthic macroinvertebrate samples were 
collected from twelve New Zealand rivers (Table S1, 
Supporting information) using a Surber sampler (0.1 m?, 
0.5 mm mesh), disturbing the substratum to a depth of 
c. 0.1 m. Macroinvertebrates were sorted into 3-mm 
length classes and identified to the lowest possible taxo- 
nomic level using the keys in Winterbourn et al. (1989) 
and counted. Macroinvertebrate abundances were calcu- 
lated by dividing macroinvertebrate numbers by the area 
sampled. Taxa were grouped by length, and dry weights 
estimated for each size class, using relationships from 
the literature (Sample et al. 1993; Towers et al. 1994). 
Macroinvertebrate biomass was calculated by multiply- 
ing abundance with mean dry weight. 


Reference sequence database construction 


A reference sequence database was generated for nine 
genes; cytochrome C oxidase subunits 1 and II (COI, 
COID, 12S, 16S and 18S ribosomal RNA (rRNA), elonga- 
tion factor 1 alpha (EFla), carbamoyl phosphate syn- 
thetase 2, aspartate transcarbamylase, dihydroorotase 
(CAD), RNA polymerase II (Pol II) and histone H3 (H3), 
covering the major macroinvertebrate groups in New 
Zealand (Stark & Maxted 2007a). Where possible, we 
made sure that groups identified via microscopy were 
represented in the sequence database; however, outside 
of COI, there were a limited number of sequences. 
Sequences were obtained from GenBank (National Cen- 
tre for Biotechnology Information) and public and pri- 
vate sequences from the Barcode of Life portal (http:// 
www.boldsystems.org/). 

Sequence coverage varied among genes and taxonomic 
groups. Plecoptera, Trichoptera and Ephemeroptera were 


well represented allowing, in most instances, classifica- 
tion to genus level. In contrast, Diptera and Mollusca were 
poorly covered and could not be assigned below Order. 
The most extensive data set was obtained for the COI gene 
(>1500 sequences). Representative sequences for the 
remaining were also available for: Ephemeroptera 12S 
rRNA, 16S rRNA, 18S rRNA, 28S rRNA and H3; Ple- 
coptera — 12S rRNA, 185 rRNA, 28S rRNA, COII and H3; 
Trichoptera EFla, CAD, POLII, 16S rRNA, 185 rRNA and 
28S rRNA. 

No sequence data were available for eight taxa (Aph- 
rophila, Archichauliodes, Elmidae, Eriopterini, Hydraeni- 
dae, Neocurupira, Stictocladius and Tanytarsus,), and the 
COI gene was sequenced for individuals of these taxa. 
Specimens were collected from rivers in Nelson, New 
Zealand. DNA was extracted using the PureLink geno- 
mic DNA extraction kit (Life Technologies, USA) follow- 
ing the manufacturer’s instructions. Polymerase chain 
reactions contained 6 uL i-Taq 2 x PCR master mix 
(Intron, Korea), 10 um of dgLCO1490 and dgHCO2198 
primers (Geller et al. 2013), c. 10 ng of template DNA 
and water to a final volume of 10 uL. Reaction conditions 
were as follows: 94 °C for 2 min followed by 34 cycles of 
94 °C for 30 s, 50 °C for 30 s, 68 °C for 30 s, with a final 
extension of 68 °C for 5 min. Amplicons of the correct 
size were sequenced bidirectionally using the BIGDYE TER- 
MINATOR v3.1 Cycle Sequencing Kit (Applied Biosystems, 
USA). 

There were no published sequences for, nor could we 
collect individuals, of some taxa. For these groups, 
sequences from a closely related taxon were used. For 
example, Collembola and Oligochaeta sequences from 
terrestrial species were incorporated into the database. 
However, it is likely that these sequences were relatively 
divergent from their freshwater counterparts. 


Genetic characterization of environmental samples 


Two methods of producing genetic libraries to charac- 
terize macroinvertebrate communities were tested. The 
first method used PCR to amplify the COI region of 
the mitochondrial genome (the so-called barcode 
region; Hebert et al. 2003). The second method used 
gene enrichment to capture target DNA from the envi- 
ronmental samples. 


DNA extraction of environmental samples 


The entire macroinvertebrate community from each sam- 
ple was homogenized (Omni GLH, Omni International) 
and DNA-extracted using the Powermax Soil Isolation 
Kit (MoBio Laboratories, USA) following the manufac- 
turer’s instructions. DNA was quantified using a Qubit 
3.0 Fluorometer (Life Technologies, USA). 
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Cytochrome C oxidase I amplification 


A region of the COI mitochondrial gene was amplified 
from the DNA using the degenerate ‘universal’ 
dgLCO1490 and dgHCO2198 primers (Folmer et al. 1994; 
Geller et al. 2013; Leray et al. 2013). Polymerase chain 
reactions contained 45 uL of Platinum PCR SuperMix 
High Fidelity (Invitrogen), 10 uM of each primer, 10- 
20 ng of template DNA and water to a final volume of 
50 uL. The reaction mixture was held at 94 °C for 2 min 
followed by 24 cycles of 94 °C for 30 s, 50 °C for 30 s, 
68 °C for 30 s, with a final extension of 68 °C for 5 min. 
Amplicons were purified using the Agencourt AMPure 
XP PCR Purification beads (Beckman Coulter) following 
the manufacturer’s instructions, and quantified using 
Qubit. 

Amplicon-metabarcoding libraries for HTS were 
constructed from the COI PCR amplicons using 
NEBNext® dual-index kit (New England BioLabs, MA, 
USA) following the manufacturer’s instructions. Samples 
were tagged with unique index combinations to allow 
pooling of samples on the HTS run. 


Gene enrichment 


Sequences from the nine gene data set were sent to 
MYcroarray (MI, USA) to design gene capture probes 
(MYbaits). MYbaits are biotinylated oligonucleotides that 
are complementary to the target sequences and are used 
to separate the DNA of the target taxa from nontarget 
DNA. A total of 20 000 probes were designed that were 
complementary to portions of all nine genes. 

Genomic DNA for gene enrichment was sheared into 
500 nucleotide long fragments via sonication using the 
Covaris (S-Series) instrument (Covaris, MA, USA). 
Library preparation for HTS was constructed from the 
sheared genomic DNA using NEBNext® dual-index kit 
(New England BioLabs, MA, USA) following the manu- 
facturer’s instructions. AMPure XP PCR Purification 
(optimized for 500 bp) was used to select 500 nucleotide 
long fragments in the genomic sample following the 
bead ratios specified in the nebNEXT kit. Samples were 
tagged with unique index combinations to allow pooling 
of samples on the HTS run. 

Target capture was undertaken on the fragmented 
genomic DNA with a 24 h hybridization following the 
MyYbaits protocol. MyYbaits capture single-stranded 
DNA; therefore, amplification of the second strand was 
completed on the MYbait beads using KAPA HiFi DNA 
Polymerase (Kapa Biosystems, USA) using the following 
conditions; 98 °C for 30 s, followed by 5 cycles of 98 °C 
for 20 s, 68 °C for 30 s and 72 °C for 45 s, with a final 
extension of 5 min at 72 °C. Primers attached to the I5 
and I7 ends of the Illumina library. 
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A systematic diagram of the workflow for the 
amplicon metabarcoding and gene-enrichment method- 
ologies is given in Fig. 1. 


High-throughput sequencing 


The COI PCR amplicon and gene-enrichment libraries 
were combined at a 1:3 ratio, due to uncertainty about 
sequencing depth required for gene-enrichment libraries. 
The libraries were sequenced using the 2 x 300 paired- 
end MiSeq reagent kit v3 (New Zealand Genomics Ltd., 
Auckland, New Zealand). 


Community characterization from sequence data 


Analysis of COI amplicon-metabarcoding data. A 
customized approach was required for sequence analysis 
because the PCR amplicons (708 nucleotides) were 
longer than the reads generated from HTS (600 nucleo- 
tides). Sequences from the environmental samples were 
analysed using rastac (Andrews 2010) and quality con- 
trol undertaken with soLexaga (Cox et al. 2010). A quality 
cut-off of 0.01 (Q > 20) in DYNAMICTRIM was used. 

The library construction method (adapter ligation) 
resulted in PCR products from both directions in the 
forward and reverse read. Sequences were separated 
into forward and reverse reads in  GENEIOUS 
6.1.6 (http://www.geneious.com; (Kearse et al. 2012). 
Sequences were sorted into one of three files; those 
starting with the forward primer, the reverse primer, or 
those not starting with a primer (subsequently dis- 
carded). The two files retained were reprocessed to 
ensure that only sequences with both a forward and 
reverse primer sequence were present. All retained for- 
ward and reverse sequences were de novo aligned 
using qume’s (Caporaso et al. 2010) align_seqs.py script 
with the MAFFT alignment method (Katoh et al. 2002; 
Katoh & Standley 2013). Alignments were checked in 
GENEIOUS and trimmed to 220 bp for forward reads and 
250 bp for reverse reads. Forward reads were 
shorter due to a region of repeated nucleotides that 
resulted in a drop in sequence quality. Individual 
forward reads were matched to their complementary 
reverse read to form CONTIGS. SOLEXAQA’s LengthSort 
tool used to remove contics less than 350 bp long. 
Chimeric sequences were removed using the iden- 
tify_chimeric_sequences.py (usearch) script in QIME. 

Taxonomic identification of the PCR amplicon 
sequences were performed using two bioinformatic 
approaches. The ‘PCR-Mot’ method used the 
Classify_seqs tool in MOTHUR 1.3.3.3 (Schloss et al. 2009). 
The COI section of reference database was aligned using 
the Muscle aligner (Edgar 2004) plug-in via GENEIous. The 
midsection of the COI sequence reference database was 
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removed as the ILLUMINA MISEQ platform used for HTS 
could only generate sequences <300 nucleotide long from 
each end of the 708 nucleotide long COI amplicons. 


Amplicon metabarcoding 

* PCR targeting a gene region 

* Requires ‘universal’ primers 

* Database required after sequencing 
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Fig. 1 Diagram showing the two molecular approaches used in this study. 
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database. Taxa represented by fewer than ten reads per 
sample were removed. 

The second classification method ‘PCR—Map’ used the 
short-read aligner BOWTIE2 2.2.3 (Langmead & Salzberg 
2012) and the COI reference database with the sequence 
files described above. Mapping was performed using the 
local alignment with the default settings. 


Analysis of Gene-enrichment data. Sequences obtained 
from invertebrate communities were analysed using 
FASTQC, and quality control undertaken with DYNAMICTRIM 
via SOLEXAQA, using the same quality cut-off (0.01) as the 
PCR amplicon samples. contics were constructed using 
the Make.contigs script in MoTHUR and conTics shorter 
than 250 bp removed by LengthSort in sOLEXAQA. CONTIGS 
were mapped to the reference sequence database using 
BOWTIE2 (via the local aligner). Species with incomplete 
rRNA gene sequences in the reference database were 
mapped to the closest sequence. This was problematic as 
it was not possible to distinguish these sequences from 
missing data; thus, all rRNA were removed from subse- 
quent analysis. This problem would be unlikely to occur 
were comprehensive rRNA database available. For the 
remaining genes (n = 6), two runs of BOWTIE2 using the 
local aligner setting were completed. The first run (all 
genes including COI) was undertaken for Plecoptera, Tri- 
choptera and Ephemeroptera with stringent settings (— 
ma 2 -L 32 -score-min L,1,0.3). Unmapped reads from 
the first run were saved, and a second mapping for all 
other taxa (to only COI sequences) was undertaken using 
more relaxed settings (-ma 2 -L 32 -score-min L,1,0.2). 
Taxa represented by fewer than ten reads per sample 
were removed. 

A second analysis was undertaken using only the COI 
sequences. All other genes were first removed using the 
local aligner on default setting in BowTlE2. The remaining 
reads were mapped against the COI database using the 
local aligner setting in BOWTIE2 (-ma 2-L 32 —score-min 
L,1,0.3). Taxa represented by less than ten reads per 
sample were removed. 


Statistical analysis 


Statistical analysis was undertaken in rR (R Development 
Core Team 2014). Spearman rank correlations were used 
to compare microscopy determined taxon density or bio- 
mass with the number of sequence reads from the four 
bioinformatics analyses of the two HTS methods. 
Rarefaction plots were generated using the R package 
Vegan (Oksanen et al. 2013) to examine the effect of 
sequencing depth on species detection. 

To compare the performance of the different methods 
for detecting rare, moderately or highly abundant taxa, 
taxa from each environmental sample were categorized 
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into <1%, <5% or >5% taxon abundance or biomass. To 
assess how the HTS and morphological data compared 
when converted into a common biotic index, the 
Macroinvertebrate Community Index (MCI) was calcu- 
lated as described in (Stark 1993) for all 12 samples char- 
acterized by morphology or genetic methods. The MCI 
uses the presence of benthic invertebrates with differing 
sensitivities to organic pollution and nutrient enrichment 
to assess water quality. Taxa are assigned a tolerance to 
pollution ranging from 1 (very tolerant) to 10 (very sensi- 
tive). These values are then summed to give a MCI score 
for each site. A value for a site >119 indicates excellent, 
100-119 good, 80-99 fair and <80 poor, water quality 
(Stark & Maxted 2007a). 


Results 


The gene-enrichment method produced up to one order 
of magnitude more sequences than the PCR amplicon 
method after quality control and filtering. The gene- 
enrichment method produced a mean of 503 122 and 
454 409 sequences per sample for all genes (GE-AIl) and 
just COI (GE-COD), respectively, compared with a mean 
of 29 557 for the COI PCR amplicon method (Table 1). 
The number of taxa morphologically identified per 
sample varied from 5 to 24 (Table 2). The GE-All method 
detected the highest number of taxa for 9 of the 12 
samples and detected the highest number of taxa among 
the genetic methods (Table 2).There was no consistent 


Table 1 Number of raw, and final reads after quality filtering 
(QF), and removal of short reads from the COI amplicon-meta- 
barcoding library (length sorting, LS) for each sample. Both COI 
amplicon-metabarcoding analyses used the same final data set. 
The final values for the gene-enrichment samples included 
removal of sequences where mapping to the database was not 
possible 





Amplicon 
metabarcoding Gene-enrichment Gene-enrichment 
All COI 
Final-QF 

FinalQF & LS Final Final 
RS4 32095 26 674 395 584 385 823 
RS10 22 943 19 745 520 423 452 071 
RS11 42207 35875 558 056 486 657 
RS12 35008 27 284 688 019 622.552 
RS13 45675 33 390 631 078 582 643 
RS14 37354 29143 409 543 401 024 
RS15 103 604 27777 412 500 336 009 
RS16 47947 40 704 418 036 366 310 
RS17 40603 34 564 554 468 479 311 
RS18 33097 28 081 474 399 437 169 
RS19 24841 21517 663 298 609 613 
RS20 37035 29 926 312 067 293 729 


COL cytochrome c oxidase subunit I. 
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Table 2 Number of macroinvertebrate taxa identified using each method. The number outside the brackets shows those identified 
using microscopy and high-throughput sequencing (HTS) data. The number in brackets shows additional taxa identified from HTS data 
but not identified by microscopy. Bold values show the genetic technique that identified the most taxa that were also identified by 





microscope 
Amplicon-metabarcoding Amplicon-metabarcoding Gene-enrichment Gene-enrichment 

Sample Microscope MOTHUR mapped All COI 
RS4 5 3 (1) 3 (2) 4 (2) 4 (2) 
RS10 16 12 (0) 10 (2) 13 (1) 13 (1) 
RS11 24 13 (1) 9 (0) 18 (1) 18 (1) 
RS12 21 10 (1) 12 (1) 20 (2) 16 (1) 
RS13 13 13 (4) 13 (1) 13 (0) 13 (0) 
RS14 13 6 (4) 7 (3) 13 (2) 10 (3) 
RS15 19 14 (4) 9 (3) 13 (1) 14 (3) 
RS16 19 6 (0) 5 (0) 11 (3) 9 (2) 
RS17 16 6 (2) 5 (1) 11 (1) 9 (0) 
RS18 22 15 (3) 12 (1) 17 (2) 15 (3) 
RS19 9 2 (1) 2 (0) 8 (2) 8 (0) 
RS20 14 10 (8) 9 (4) 9 (0) 10 (0) 


COI = cytochrome oxidase I. 


pattern to the number of taxa detected by each method 
with morphology detecting more taxa than genetic meth- 
ods from some samples (e.g. RS11), and genetic methods 
detecting more taxa than morphology for other samples 
(e.g. the amplicon-metabarcoding method with the 
MOTHUR analysis on sample RS20). All genetic techniques 
detected the complete set of morpho-taxa for sample 
RS13. Some taxa were detected genetically but were not 
detected using microscopy. In general, the PCR-Mot 
method detected the highest number of taxa not identi- 
fied using microscopy (Table 2). 

A total of 46 taxa were identified from the samples 
using microscopy and HTS methods (Table S2, Support- 
ing information). The proportion of taxon sequences 
from both gene-enrichment techniques more closely 
matched biomass (as determined by microscopy) com- 
pared with the proportion of taxon sequences from the 
amplicon-metabarcoding technique, while no genetic 
methods matched closely to abundance (Fig. 2). In some 
samples, the taxon sequence proportions obtained using 
the amplicon-metabarcoding technique were dominated 
by a single taxon, for example, Orthocladiinae in RS4, 
Hydrobiosis in RS12 and Archichauliodes in RS19 and 20 
(Fig. 2; Table S2, Supporting information). Some taxa, 
such as Pycnocentrodes — which accounted for 29% of the 
RS12 community when using microscopic abundance, 
were not detected above 3% in the samples using either 
of the genetic techniques (Fig. 2). Conversely, some com- 
munities had taxa that accounted for a relatively high 
percentage of the sequence reads but were detected at 
very low levels with microscopy (e.g. Nesameletus in 
RS12 accounted for 15% and 28% of the gene enrichment 
with GE-all and GE-COI, respectively, but represented 


only <0.1% and 2.1% of abundance and biomass by 
microscopy; Fig. 2). 

The correlations between sample abundance and bio- 
mass determined microscopically, and the number of 
sequence reads generated using the four genetic meth- 
ods, showed that the highest R* for each sample was 
obtained using GE-all for 7 of the 12 samples (Table 3). 
For one sample (R54), the highest R? was obtained using 
PCR-Mot method (Table 3). The correlation between the 
two PCR amplicon methods and the microscope analysis 
was very weak (R*<0.5) for 41 of 48 correlations 
(Table 3). The correlations were much stronger for the 
majority of the two gene-enrichment techniques (25 of 48 
with R?> 0.5; Table 3). Among the gene-enrichment 
techniques, the highest R? was usually obtained for the 
biomass data (20 of 24; Table 3). 

We assessed the detectability of rare (defined as less 
than 1% of the total abundance or biomass), moderately 
abundant (1-5%) and highly abundant (>5%) taxa with 
each of the genetic methods. The median detection rate 
of both abundance and biomass for all classes was high- 
est with the GE-All method, closely followed by the 
GE-COI (Fig. 3a,b). With the exception of highly abun- 
dant taxa, PCR-Mot performed better than the PCR-Map 
analysis. Among genetic methods, the lowest median 
proportion of taxa detected and the highest variability 
occurred in the rare taxa class for both abundance and 
biomass (Fig. 3a,b). 

A comparison of the MCI scores generated for each 
sample using microscopy with the scores generated by 
the genetic methods resulted in reclassification of seven 
sites (Table 4). Most of the reclassifications were due to 
small changes that moved a site that was near the edge 
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Fig. 2 The proportional abundance and biomass of taxa (>3% per sample) at 6 of the 12 sampling sites determined by microscopy and 
the relative frequency of sequence reads from those samples using the four genetic approaches (see text for details). These sites were 
chosen to represent a cross-section of samples with differing taxon and community composition. GE-ALL = gene enrichment with all 
genes included in analysis, GE-COI = gene-enrichment technique with only the cytochrome C oxidase subunit I gene included in analy- 
sis, PCR-Mot = amplicon-metabarcoding data analysed with MorHuR, PCR-Map = amplicon-metabarcoding data analysed by mapping. 
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Table 3 Strength of correlation (Spearman rank correlation coefficients) between the abundance and biomass of each taxon at each site 
determined using microscopy, and the number of sequence reads generated for each taxon by the genetic techniques. Bold values show 


the genetic technique that gave the highest R? for each sample 


Amplicon metabarcoding 


Amplicon metabarcoding 


Gene-enrichment Gene-enrichment 

















MOTHUR R? Mapped R? All R? COI R2 
Abundance Biomass Abundance Biomass Abundance Biomass Abundance Biomass 
RS4 0.938 0.143 0.936 0.165 0.097 0.825 0.122 0.851 
RS10 0.333 0.609 0.389 0.711 0.494 0.743 0.460 0.620 
RS11 0.000 0.032 0.000 0.031 0.030 0.675 0.039 0.483 
RS12 0.004 0.074 0.003 0.073 0.002 0.364 0.003 0.073 
RS13 0.384 0.495 0.393 0.503 0.932 0.965 0.946 0.969 
RS14 0.821 0.480 0.829 0.481 0.504 0.865 0.539 0.882 
RS15 0.008 0.002 0.128 0.120 0.467 0.977 0.494 0.785 
RS16 0.011 0.004 0.011 0.005 0.513 0.410 0.348 0.187 
RS17 0.001 0.003 0.001 0.003 0.067 0.333 0.073 0.360 
RS18 0.054 0.105 0.009 0.028 0.143 0.417 0.128 0.362 
RS19 0.033 0.013 0.044 0.018 0.607 0.697 0.578 0.679 
RS20 0.005 0.001 0.005 0.000 0.828 0.672 0.798 0.624 
COI = cytochrome C oxidase subunit I. 
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a e or 
v o ; 
z z ‘Oo 
-+ i ' ' i 
® o o o i gal z ee 
E B- o 
kej ~ 
6 m i? 
i= A © t 
O o 
Q 
e H 
— 
a ; : 
° _ o ° -æ [o] o 
v I AA Y CA v IA v IA L A yv EA vy o A v I A 
2iLFzoLo GaS eke =3 0O50 B88 F € 
34 989 $32 233 43858 232 fgg 
OWO uw Q wW t 1 t I = l wom w o W a ow ry : t 
ia OF 9° SES FaG 9 96°F fog go 


Fig. 3 Proportion of morphologically identified taxa in each sample detected by the four genetic approaches. Taxa are binned in three 
groups (<1%, 1-5%, and more than 5% of the community) based on (a), abundance of individuals and (b) biomass of individuals. Solid 
black lines shows medians, boxes shows 1st and 3rd quartiles, whiskers extend to the last data point within 1.5 times of the interquartile 
range, open circles are outliers beyond this range. GE-All = gene-enrichment technique with all genes included in analysis, GE- 
COI = gene-enrichment technique with only the cytochrome oxidase 1 gene included in analysis, PCR-Mot = amplicon-metabarcoding 
data analysed with MorHuR, PCR-Map = amplicon-metabarcoding data analysed by mapping. 


of a category’s range just into the range of the next 


category. 


Discussion 


Environmental DNA techniques, in particular those 
that utilize HTS, will provide immense benefits to 


biomonitoring (Shokralla et al. 2012; Pawlowski et al. 
2014). These techniques will enable more samples to be 
analysed, decrease processing times, improve cost effi- 
ciency and reduce reliance on taxonomic expertise 
(Wood ef al. 2013; Kelly et al. 2014). However, an in- 
depth knowledge of the benefits and caveats of each 
method is essential (Creer et al. 2010). 
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Table 4 Macroinvertebrate community indices (MCI) 
calculated using microscopy, and the four genetic approaches. 
MCI categories >119 excellent, 100-119 good, 80-99 fair and <80 
poor water quality (Stark & Maxted 2007a). Grey shading shows 
where the genetic method resulted in a different MCI category 
from that obtained using microscopy 


Molecular invertebrate community index 











Gene 

COI amplicon enrichment 
Site Microscope MOTHUR Mapped All COI 
RS4 116 95 96 110 110 
RS10 103 95 108 103 111 
RS11 121 114 107 124 122 
RS12 122 105 111 118 119 
RS13 126 120 127 126 126 
RS14 106 112 114 (22 118 
RS15 99 101 102 107 106 
RS16 141 130 144 129 139 
RS17 96 103 104 102 109 
RS18 107 100 102 107 100 
RS19 100 147 130 96 100 
RS20 120 132 122 127 124 


In this study, we compared the widely used ampli- 
con-metabarcoding technique with a gene-enrichment 
methodology that circumvented the requirement for 
PCR amplification using universal primers and thereby 
removed associated primer biases. In general, the gene- 
enrichment method detected more taxa, had stronger 
correlations with morphologically determined abun- 
dances and biomasses and was more effective at detect- 
ing rare taxa than the amplicon-metabarcoding 
technique. Large numbers of sequences were obtained 
for all samples using both methods (c. 20 000-690 000 
per sample; Table 1). The sequencing depth was nearly 
10-fold higher for the gene-enrichment method, which is 
likely due to the higher concentration of the gene-enrich- 
ment libraries sent for sequencing. The sequencing depth 
obtained for the metabarcoding samples was sizeable 
considering the low diversity in the samples (max. 20 
taxa; Table 1). Rarefaction analysis of the sequences 
obtained from the amplicon-metabarcoding method 
showed sufficient sequencing coverage for all identified 
taxa (Fig. S1, Supporting information), demonstrating 
that the under estimation of diversity or reduced abil- 
ity to detect rare species using this method was not due 
to a lack of sequencing depth (Fig. S1, Supporting 
information). 

Stronger correlations were observed between the 
numbers of reads generated using the gene enrichment 
and morphological (abundance and biomass) analysis 
than with the amplicon metabarcoding to morphology 
comparison. In most samples, the weaker correlation 
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between morphological characterization and amplicon 
metabarcoding was due to the nonamplification of some 
taxa. Our data corroborate previous research suggesting 
that the COI primers (dgLCO1490 and dgHCO2198; Gel- 
ler et al. 2013) used in this, and many other studies are 
not truly universal and that primer biases occur (Clarke 
et al. 2014; Deagle et al. 2014). In the most biased samples 
in our study, that is, RS4 and RS19, amplicon sequencing 
resulted in greater than 80% of the sequence reads 
originating from just one taxon (Fig. 2). 

Currently, Illumina sequencing platforms are limited 
to a read length of approximately 600 nucleotides. Even 
when amplicons are sequenced in both directions 
(paired-end), the c. 700 bp of the COI gene amplified 
using the dgLCO1490 and dgHCO2198 primers cannot 
be fully sequenced. In this study, we dealt with the 
incomplete reads obtained from amplicon metabarcod- 
ing by analysing the sequences from the start and end of 
COI, and removing ambiguous data from the middle 
region. In contrast, the gene-enrichment method enabled 
sequence data from across the entire COI gene to be anal- 
ysed. Longer sequences are likely to aid in more accurate 
taxonomic assignment, particularly of closely related 
species. 

A benefit of amplicon metabarcoding over gene 
enrichment is that no prior knowledge of the taxa pre- 
sent in samples is required. Extensive reference sequence 
databases are required for the gene-enrichment method 
to enable adequate probe design. For example, gene 
enrichment failed to detect flatworms (Platyhelminthes), 
whereas the PCR-Mot analysis was the only analysis to 
successfully detect the taxon (Table $2, Supporting infor- 
mation). This failure of the gene-enrichment technique 
suggests that the genetic distance between our reference 
database and the flatworms (Platyhelminthes) in our 
samples was too great for adequate probe design and 
mapping, highlighting the ongoing need for robust refer- 
ence sequence databases (Bik et al. 2012). 

Recent amplicon-metabarcoding studies have high- 
lighted the advantages of using sequences from several 
genetic regions to characterize communities and 
suggested this may address the unavailability of truly 
universal primers (De Barba et al. 2014; Deagle et al. 
2014). This approach requires multiple PCRs, each sub- 
ject to PCR biases, and is likely to be limited to two or 
three markers, depending on the sequencing depth 
required and the next-generation sequencing platform 
utilized. Gibson etal. (2014) used an alternative 
approach incorporating multiple combinations of COI 
primers to smooth coverage among different taxa. They 
successfully identified 91% of the distinguishable indi- 
viduals in a bulk environmental sample. Generation of 
sequence data from more than three regions can be 
achieved easily using gene enrichment through the 
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design and synthesis of capture probes for multiple 
genes (Blow 2009). It also enables taxon-specific COI 
probes to be designed, that is a similar approach to the 
multiple parallel amplification primers used in Gibson 
et al. (2014). In the present study, we targeted nine differ- 
ent genes, of which six were used in the final analysis. In 
general, more taxa were detected from the gene-enrich- 
ment data when all genes were included compared with 
analysing COI sequences alone (Table 2). Multigene 
approaches could also enable internal validation of 
detections (i.e. by confirming the presence of taxa 
through positive detections from several genes) and 
therefore reduce the risk of false positives due to a match 
to a single gene, nondetection or biased results. How- 
ever, in this study, the number of reference sequences 
available for genes other than COI was extremely limited 
preventing comprehensive probe designs for the taxa. 
The availability of more reference sequences will 
enhance the sensitivity and specificity of the gene-enrich- 
ment technique. 

New Zealand’s aquatic fauna is depauperate com- 
pared to other southern hemisphere countries (Joy & 
Death 2013). The species diversity identified in the sam- 
ples used in this study is typical of rivers across New 
Zealand with varying water qualities. Provided robust 
databases exist, we believe that the performance of the 
gene-enrichment technique compared to amplicon 
metabarcoding would be further enhanced in more 
diverse samples, such as those that might be encountered 
in South American or African streams. The primary rea- 
son for this is that the gene-enrichment technique 
enables taxa-specific capture probes to be utilized and 
this negates the challenges associated with designing 
universal primers. 

Metabarcoding data were analysed using two differ- 
ent bioinformatics approaches. In some samples, this 
resulted in marked differences in the number of taxa 
detected. The PCR-Mot approach identifies a sequence’s 
closest match in the reference database. Thus, a sequence 
is assigned to the lowest taxonomic level possible. In 
contrast, PCR-Map discards any sequences obtained that 
do not have close sequence homology to those in refer- 
ence databases. The difference in performance between 
these two bioinformatics approaches highlights the need 
for careful selection of analytical methods. For example, 
if the aim of a study is to characterize diversity, the PCR- 
Mot approach may be the most appropriate. Where 
definitive identification of taxa is required, for example, 
surveys targeting invasive species, and where an 
extensive sequence database exists, the more stringent 
PCR-Map approach may be more appropriate. 

Numerous studies have highlighted the benefits of 
molecular over morphological approaches for identifying 
both cryptic species and cryptic life stages (Hebert et al. 


2004; Lindeque et al. 2013). The advantages of molecular 
techniques are particularly apparent for monitoring 
macroinvertebrates in rivers, where early instar larval 
stages are often difficult to identify below the taxonomic 
level of family. Hajibabaei et al. (2011) detected nine spe- 
cies of river macroinvertebrates using 454 pyrosequenc- 
ing that had not been identified from morphological 
examination. A similar result was obtained for some of 
our samples, despite extremely careful morphological 
identification. For example, in RS15, Hydraenidae — a 
family of small aquatic beetles only 1-3 mm in length, 
was detected with both HTS methods, but not with 
microscopy. Zhou et al. (2013) suggested that the detec- 
tion of unexpected taxa may not only be from misidenti- 
fication but also from DNA in the gut of specimens, 
undetected tissue (damaged pieces, eggs, small body 
size) or environmental DNA — the impact of these data 
on biodiversity studies awaits further analysis. 

Few studies have examined the quantitative ability of 
HTS (Porazinska et al. 2010; Deagle et al. 2013). Most 
research presumes only weak correlations between 
sequence frequency and abundance and suggest that 
environmental communities are only comparable in 
terms of relative taxon abundance (ie. normalized 
sequence reads per OTU; (Bik ef al. 2012). We found 
marked variability in the correlation between morpho- 
logically determined abundance and biomass, and the 
number of sequence reads. The correlations were gener- 
ally stronger between sequence frequency generated 
using the gene-enrichment method and biomass than the 
correlations between sequence frequency generated from 
amplicon metabarcoding and biomass. Overall, correla- 
tions were not strong even for the gene-enrichment 
method, which is possibly due to differences in gene 
copies among taxa, and variability in cell numbers in the 
life stages of taxa in the samples. Incomplete reference 
databases may also accentuate weak correlations through 
misassignment of taxa. It is unlikely that methods such 
as gene enrichment will fully resolve discrepancies 
between morphological and molecular techniques with 
respect to taxa enumeration. Pre-DNA extraction steps 
such as size filtering or removing large individuals may 
reduce biases and require further research. 

The limited coverage of reference sequence databases 
is a barrier to large-scale implementation of molecular 
biomonitoring. Efforts have been made to build 
standardized barcode reference libraries for many major 
animal and plant phyla, but numerous gaps remain. 
Country or regionally focused effort may be required, for 
example New Zealand’s lotic fauna is characterized by a 
very high proportion of endemic species (Winterbourn 
1980), and our study required a substantial effort to 
extend the reference sequence database. Despite this 
effort, voucher specimens with verified sequences 


© 2015 John Wiley & Sons Ltd 


GENE ENRICHMENT FOR ENVIRONMENTAL BIOMONITORING 1251 


are still lacking for many species. Accuracy is also a 
problem, for example, even in groups with well- 
developed reference databases greater than 20% of 
entries may be incorrectly identified, and/or lack meta- 
data such as collection location (Nilsson et al. 2006). 
The performance of both amplicon-metabarcoding and 
gene-enrichment techniques in this study would have 
been improved with a more extensive database. 

Changing from a morphologically based community 
index to a genetic-based index will not simply be a mat- 
ter of using different methods for identifying specimens. 
Because of difficulties identifying the larval stages of 
macroinvertebrates to species level, most biotic indices 
require identifications to generic or higher levels (e.g. 
Stark & Maxted 2007b). However, the closest sequence 
match of a specimen may not be to its congeners, thus 
resulting in differing results for morphology and 
sequence-based indices. It is likely that genetic-based 
indices will require validation in the field before they 
can be used widely. Nevertheless, there are indications 
that molecular-based indices may be feasible. Gene 
enrichment using only the COI sequences generated the 
MCI scores closest to the morphology-based MCI 
(Table 3). Even when shifts in the MCI classification did 
occur, these were minor and would be unlikely to 
prompt a change in management practises. Differences 
were greater between amplicon metabarcoding and mor- 
phology-based MCI scores suggesting that some valida- 
tion is required before this method can be used to assess 
water quality. The MCI relies solely on presence/absence 
data and the development of molecular-based indices 
that include abundance data in an appropriate format, 
that is, relative sequence abundance, is an interesting 
avenue for further investigation. Other studies using bar- 
coding of individual macroinvertebrate taxa, as opposed 
to the approach used here where the entire community 
was homogenized and identification made post-sequen- 
cing, have highlighted the benefits of using molecular 
approaches particularly for the identification of small, 
immature and damaged specimens (Sweeney et al. 2011; 
Stein et al. 2013). Both of these studies indicate that iden- 
tifications based on DNA barcoding will improve the 
ability to undertake ecological assessments and detect 
changes in river/stream condition (Sweeney et al. 2011; 
Stein et al. 2013). 

A disadvantage of the gene-enrichment technique 
compared with amplicon- metabarcoding is the cost of 
probes and increased library preparation time. In 2015, 
the cost of synthesizing probes ranged between US$50 
and US$200 per sample depending on the size of the kit. 
Several additional laboratory preparation steps are 
required for the gene-enrichment technique, which took 
an extra three hours to process and required a 24-h 
hybridization step. Sequencing costs are comparable 
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between techniques, although we recommend sequenc- 
ing fewer samples per MiSeq run when the enrichment 
method is used to enable sufficient coverage of all genes 
(i.e. in this study six using gene enrichment, compared 
to only one using the amplicon-metabarcoding approach. 
Recent developments in ‘Capture Probe’ technologies 
may provide more cost-effective solutions. For example, 
the nCounter analysis system (Nanostring technologies, 
WA, USA) involves the hybridization of probes to target 
DNA in a similar manner to the gene-enrichment tech- 
nique used in this study. It allows single tube multiplex- 
ing of up to 800 genes or markers. Each gene/marker is 
then distinguished through digital photography of its 
colour-coded barcode, thereby removing the need for 
downstream bioinformatics analysis. 


Summary 


We compared the sensitivity of PCR-based amplicon 
metabarcoding with a PCR-free genetic enrichment 
method to characterize aquatic macroinvertebrate com- 
munities. Amplicon metabarcoding is currently cheaper, 
sample preparation is faster, and there is no requirement 
for prior knowledge of sequences of taxa present in an 
environmental sample. However, this study highlights 
the challenges of designing universal primers, particu- 
larly for the COI gene, and found that primer biases 
introduced during PCR amplification decreased the accu- 
racy of the community characterizations. In contrast, the 
gene-enrichment method does not require a PCR amplifi- 
cation step, removing primer bias. We found stronger 
correlations between gene-enrichment community char- 
acterization and morphological assessments than those 
between amplicon metabarcoding and morphology- 
based characterization. The ability of the gene-enrich- 
ment method to target multiple genetic markers simulta- 
neously offers numerous benefits, including internal 
validation of positive detections. One of most pressing 
challenges for the implementation of HTS techniques for 
biomonitoring is the need for robust and well-validated 
reference sequence databases. In this study, results from 
both HTS methods would have been improved if suffi- 
cient reference sequence data had been available for all 
taxa present in the environmental samples. 
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